An emotion classification project using machine learning to identify four emotions (happy, sad, angry, fear) from speech audio recordings, comparing different algorithms to determine the most effective approach for acoustic emotion recognition.
Author
Affiliation
Ralp Andrade
College of Information Science, University of Arizona
Abstract
This study investigates the capability of machine learning models to classify emotional states from acoustic features. Using the CREMA-D (Crowd-Sourced Emotional Multimodal Actors Dataset), we focused on six target emotions: neutral, happy, sad, angry, fear, and disgust. Numerical audio features were extracted via librosa, standardized, and reduced using PCA to retain 98% of variance. We evaluated and compared standalone algorithms (SVM), ensemble methods (Random Forest), and neural networks (MLP) for multi-class emotion recognition. Results indicate that the MLP achieved the highest macro F1-score (0.5534), demonstrating superior ability to capture non-linear patterns and balanced performance across all emotions.
Introduction
Automatic emotion classification from speech remains a challenging problem due to the subtlety and variability of acoustic cues. This project leverages the CREMA-D dataset, comprising 7,442 .wav clips from 91 actors portraying six basic emotions. Prior research demonstrates that features such as MFCCs and spectral properties are informative for emotion detection, yet there is no consensus on optimal feature selection or model architecture (Banerjee, Huang, & Lettiere, n.d.).
We transformed raw audio into quantitative features using librosa and applied a standardized preprocessing pipeline. The goal is to assess the effectiveness of both traditional and neural network-based models for multi-class emotion recognition, providing a comparative evaluation of standalone algorithms, ensemble methods, and neural networks.
Research Question
Q1. What is the classification accuracy of unsupervised methods for emotion recognition from acoustic features?
Q2. How do standalone algorithms, ensemble approaches, and neural networks compare in terms of accuracy, robustness, and computational efficiency for emotion classification from audio?
Exploratory Analysis
Inital data load & Observations
# Import libsimport numpy as npimport pandas as pdimport matplotlib.pyplot as pltimport seaborn as snsfrom sklearn.preprocessing import PowerTransformer, LabelEncoder, StandardScalerfrom sklearn.decomposition import PCAfrom sklearn.model_selection import train_test_split# Import required libraries for models and metricsfrom sklearn.ensemble import RandomForestClassifierfrom sklearn.svm import SVCfrom sklearn.neural_network import MLPClassifierfrom sklearn.metrics import (classification_report, confusion_matrix, accuracy_score, precision_score, recall_score, f1_score)#import data into dfdf = pd.read_csv("./data/crema_d.csv", index_col=0)# number of variables and observations in the dataprint(f"Total observations: {df.shape[0]}")print(f"Number of features: {df.shape[1]}")# Numeric summaryprint(df.describe())# missing values in each columnmissing_df = df.isna().sum()print("Missing values per column:\n", missing_df)
We analyzed class distribution across the six emotions (neutral, happy, sad, angry, disgust, and fear). Our analysis showed that the “neutral” class has approximately 14.5% fewer samples than the other classes. Given this mild imbalance, no resampling was performed.
plt.figure(figsize=(8, 6))sns.countplot(data=df, x='emotion', hue='emotion')plt.title("Target Counts")plt.xticks(rotation=45)plt.show()# Show class distribution for all emotionsemotion_counts = df['emotion'].value_counts()#print("Class distribution:\n", emotion_counts)# Focus on the four target emotionstarget_emotions = ['happy', 'sad', 'angry', 'fear', 'neutral','disgust']target_counts = df[df['emotion'].isin(target_emotions)]['emotion'].value_counts()#print("\nTarget emotion counts:\n", target_counts)
Data Preprocessing
Data Cleaning & Transformation
Data pre-processing included handling missing values, removing irrelevant columns, and transforming numerical features to reduce skewness. The Yeo–Johnson power transformation was applied to achieve more symmetric distributions (skewness < ±0.5), improving suitability for downstream modeling. The categorical target variable (‘emotion’) was encoded into numerical labels for compatibility with machine learning algorithms.
# Remove unwanted columnsdf = df.drop(columns=['actor_id', 'sentence', 'intensity','sample_rate'])#Indentify skew of numeric colsnum_cols = df.select_dtypes(include='number').columns#print(num_cols)skewness_before = df[num_cols].skew().sort_values(ascending=False)#print("Before:", skewness_before)# yeo-johnson power transformeryeojt = PowerTransformer(method='yeo-johnson', standardize=True)df_transformed = yeojt.fit_transform(df[num_cols])# Create DataFrame from transformed data with correct columnsdf_t = pd.DataFrame(df_transformed, columns=num_cols)# Overwrite original columns in df with transformed valuesdf[num_cols] = df_t# Get skewness after transformationskewness_after = df[num_cols].skew().sort_values(ascending=False)#print("After:", skewness_after)# Combined before/after skewness chartfig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))# Before transformationsns.barplot(x=skewness_before.values, y=skewness_before.index, ax=ax1, color='lightcoral')ax1.set_title("Skewness Before Yeo-Johnson", fontsize=14)ax1.axvline(x=0, color='red', linestyle='--', alpha=0.7)ax1.set_xlabel("Skewness", fontsize=12)# After transformationsns.barplot(x=skewness_after.values, y=skewness_after.index, ax=ax2, color='lightblue')ax2.set_title("Skewness After Yeo-Johnson", fontsize=14)ax2.axvline(x=0, color='red', linestyle='--', alpha=0.7)ax2.set_xlabel("Skewness", fontsize=12)# Adjust layoutplt.tight_layout()plt.show()# encode target variablele = LabelEncoder()df['encoded_emotion'] = le.fit_transform(df['emotion'])# View mapping#print(dict(zip(le.classes_, le.transform(le.classes_))))
Splitting the dataset
The data is split using the train_test_split function, with 20% of the data reserved for testing. For the features (X), we drop the target column and its encoded variant, while for the labels (y), we retain only the encoded emotion.
# train/test split for target emotionsX = df.drop(columns=['emotion','encoded_emotion'])y = df['encoded_emotion']X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)
Data Scaling and Dimension Reduction
Before training our models, we applied data scaling and dimensionality reduction to the numeric features. First, we identified all numeric columns in the dataset and applied Standard Scaling to ensure each feature has zero mean and unit variance.
Next, we performed Principal Component Analysis (PCA) to reduce dimensionality while retaining 98% of the variance. PCA transforms the scaled features into a set of orthogonal components, capturing most of the information in fewer dimensions. We evaluated the explained variance ratio for each component and visualized it using a scree plot to confirm the number of components selected.
To address our research questions on emotion recognition from acoustic features, we trained and evaluated multiple machine learning models, including standalone algorithms, ensemble methods, and neural networks. The models were trained on the PCA-reduced and standardized feature set to improve convergence, reduce dimensionality, and mitigate potential overfitting.
For each model, we used the evaluate_model function, which trains the model and reports comprehensive performance metrics. These metrics include overall accuracy, macro-averaged precision, recall, and F1-score, as well as per-class performance for each emotion in the target set. To provide a visual assessment of prediction quality, confusion matrices were generated for all models.
Specifically, we evaluated: - Random Forest (RF): An ensemble method configured with 1000 trees, maximum depth of 15, and out-of-bag scoring to provide robust predictions while controlling for overfitting.
Support Vector Machine (SVM): A kernel-based model with RBF kernel, class-weight balancing for mild class imbalance, and probability estimates enabled.
Multilayer Perceptron (MLP): A neural network with three hidden layers (128, 64, 32 neurons), early stopping, and a 10% validation split to monitor convergence.
# Skip extending y_train - we want to exclude omitted emotions from predictions# Use original y_train that only contains target emotions# Get emotion labels for reportingemotion_labels = [emotion for emotion in le.classes_ if emotion in target_emotions]target_label_mapping = {le.transform([emotion])[0]: emotion for emotion in emotion_labels}print("Target emotion label mapping:", target_label_mapping)def evaluate_model(model, X_train, X_test, y_train, y_test, model_name):""" Train and evaluate a model with comprehensive metrics """print(f"Training {model_name}")print(f"{'='*50}")# Train model model.fit(X_train, y_train)# Make predictions y_pred = model.predict(X_test)# Overall metrics accuracy = accuracy_score(y_test, y_pred) precision_macro = precision_score(y_test, y_pred, average='macro', zero_division=0) recall_macro = recall_score(y_test, y_pred, average='macro', zero_division=0) f1_macro = f1_score(y_test, y_pred, average='macro', zero_division=0)print(f"\n{model_name} Overall Performance:")print(f"Accuracy: {accuracy:.4f}")print(f"Macro Precision: {precision_macro:.4f}")print(f"Macro Recall: {recall_macro:.4f}")print(f"Macro F1-Score: {f1_macro:.4f}")# Per-class metricsprint(f"\n{model_name} Per-Class Performance:") target_names = [target_label_mapping.get(i, f'Class_{i}') for i insorted(np.unique(y_test))] report = classification_report(y_test, y_pred, target_names=target_names, output_dict=True, zero_division=0)# Display per-class metrics in a formatted wayprint(f"{'Emotion':<10}{'Precision':<10}{'Recall':<10}{'F1-Score':<10}{'Support':<10}")print("-"*50)for emotion in target_names:if emotion in report: metrics = report[emotion]print(f"{emotion:<10}{metrics['precision']:<10.4f}{metrics['recall']:<10.4f} "f"{metrics['f1-score']:<10.4f}{int(metrics['support']):<10}")# Confusion Matrix cm = confusion_matrix(y_test, y_pred) plt.figure(figsize=(8, 6)) sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=target_names, yticklabels=target_names) plt.title(f'{model_name} - Confusion Matrix') plt.xlabel('Predicted') plt.ylabel('Actual') plt.show()# Return results for comparisonreturn {'model_name': model_name,'accuracy': accuracy,'precision_macro': precision_macro,'recall_macro': recall_macro,'f1_macro': f1_macro,'per_class_report': report }# Initialize models# random forestrf_model = RandomForestClassifier( n_estimators=1000, # More trees for better performance max_depth=15, # Prevent overfitting while allowing complexity min_samples_split=5, # Require more samples to split min_samples_leaf=2, # Minimum samples in leaf nodes max_features='sqrt', # Feature sampling at each split bootstrap=True, # Bootstrap sampling oob_score=True, # Out-of-bag scoring n_jobs=-1, # Use all available cores random_state=42)# support vector machinesvm_model = SVC( kernel='rbf', # RBF kernel works well for most cases C=10.0, # Regularization parameter gamma='scale', # Kernel coefficient (auto-scaled) probability=True, # Enable probability estimates cache_size=1000, # Increase cache for faster training class_weight='balanced', # Handle imbalanced datasets random_state=42)# standard neral network mlp_model = MLPClassifier( hidden_layer_sizes=(128, 64, 32), # network max_iter=1000, # Maximum iterations early_stopping=True, # Stop when validation score stops improving validation_fraction=0.1, # validation set random_state=42)# Evaluate ensemble models along with stand alone modelsresults = []# Random Forestrf_results = evaluate_model(rf_model, X_train_pca, X_test_pca, y_train, y_test, "Random Forest")results.append(rf_results)# SVMsvm_results = evaluate_model(svm_model, X_train_pca, X_test_pca, y_train, y_test, "SVM")results.append(svm_results)# MLPmlp_results = evaluate_model(mlp_model, X_train_pca, X_test_pca, y_train, y_test, "MLP(Neural Net)")results.append(mlp_results)
We compared the performance of three model types—Random Forest, SVM, and a multilayer perceptron (MLP), on the emotion recognition task using macro-averaged metrics and per-emotion performance. Across overall performance metrics, the MLP consistently outperformed both Random Forest and SVM, achieving the highest accuracy (0.5567), macro precision (0.5538), macro recall (0.5574), and macro F1-score (0.5534). This indicates that the neural network is most effective at capturing complex patterns in the PCA-reduced acoustic feature space.
Per-emotion analysis revealed nuanced differences among models. For Angry, all models performed relatively well, with MLP achieving the highest F1-score (0.7374). Disgust and Fear were more challenging emotions, with lower F1-scores overall, though MLP slightly improved performance over other models. For Happy and Neutral, MLP again showed superior F1-scores, particularly in improving recall for the Neutral class (0.5826). Sad emotion classification also favored MLP, demonstrating balanced precision and recall (F1-score 0.5857).
Overall, while ensemble methods like Random Forest and kernel-based SVM provide competitive performance for certain classes, the MLP’s ability to model non-linear interactions across multiple dimensions makes it the best-performing approach in this task. These results highlight that neural network-based models may be better suited for emotion recognition from acoustic features, addressing both classification accuracy and balanced performance across all emotions.
print("MODEL COMPARISON SUMMARY")print(f"{'='*50}")# Create comparison DataFramecomparison_df = pd.DataFrame({'Model': [result['model_name'] for result in results],'Accuracy': [result['accuracy'] for result in results],'Precision (Macro)': [result['precision_macro'] for result in results],'Recall (Macro)': [result['recall_macro'] for result in results],'F1-Score (Macro)': [result['f1_macro'] for result in results]})print(comparison_df.round(4))# Visualize model comparisonfig, axes = plt.subplots(2, 2, figsize=(15, 12))metrics = ['Accuracy', 'Precision (Macro)', 'Recall (Macro)', 'F1-Score (Macro)']colors = ['skyblue', 'lightcoral', 'lightgreen']for i, metric inenumerate(metrics): row = i //2 col = i %2 ax = axes[row, col] bars = ax.bar(comparison_df['Model'], comparison_df[metric], color=colors) ax.set_title(f'{metric} Comparison') ax.set_ylabel(metric) ax.set_ylim(0, 1)# Add value labels on barsfor bar in bars: height = bar.get_height() ax.text(bar.get_x() + bar.get_width()/2., height +0.01,f'{height:.3f}', ha='center', va='bottom')plt.tight_layout()plt.show()# Best performing modelbest_model_idx = comparison_df['F1-Score (Macro)'].idxmax()best_model = comparison_df.iloc[best_model_idx]print(f"\nBest Performing Model: {best_model['Model']}")print(f"F1-Score (Macro): {best_model['F1-Score (Macro)']:.4f}")# Detailed per-emotion analysis across all modelsprint("PER-EMOTION PERFORMANCE ACROSS MODELS")print(f"{'='*50}")# Get target emotion names from the unique test labelstarget_names = [target_label_mapping.get(i, f'Class_{i}') for i insorted(np.unique(y_test))]emotion_performance = {}for emotion in target_names: emotion_performance[emotion] = {}for result in results:if emotion in result['per_class_report']: emotion_performance[emotion][result['model_name']] = {'precision': result['per_class_report'][emotion]['precision'],'recall': result['per_class_report'][emotion]['recall'],'f1_score': result['per_class_report'][emotion]['f1-score'] }# Create detailed emotion analysisfor emotion in target_names:print(f"\n{emotion.upper()} Performance:")print(f"{'Model':<15}{'Precision':<10}{'Recall':<10}{'F1-Score':<10}")print("-"*50)for model_name, metrics in emotion_performance[emotion].items():print(f"{model_name:<15}{metrics['precision']:<10.4f}{metrics['recall']:<10.4f}{metrics['f1_score']:<10.4f}")
Best Performing Model: MLP(Neural Net)
F1-Score (Macro): 0.5534
PER-EMOTION PERFORMANCE ACROSS MODELS
==================================================
ANGRY Performance:
Model Precision Recall F1-Score
--------------------------------------------------
Random Forest 0.6067 0.7835 0.6838
SVM 0.6541 0.7520 0.6996
MLP(Neural Net) 0.6996 0.7795 0.7374
DISGUST Performance:
Model Precision Recall F1-Score
--------------------------------------------------
Random Forest 0.4744 0.4370 0.4549
SVM 0.4960 0.4921 0.4941
MLP(Neural Net) 0.5205 0.4488 0.4820
FEAR Performance:
Model Precision Recall F1-Score
--------------------------------------------------
Random Forest 0.4834 0.2874 0.3605
SVM 0.4478 0.4724 0.4598
MLP(Neural Net) 0.4845 0.4921 0.4883
HAPPY Performance:
Model Precision Recall F1-Score
--------------------------------------------------
Random Forest 0.4938 0.4667 0.4798
SVM 0.5152 0.4667 0.4897
MLP(Neural Net) 0.5463 0.4627 0.5011
NEUTRAL Performance:
Model Precision Recall F1-Score
--------------------------------------------------
Random Forest 0.4519 0.4954 0.4726
SVM 0.4846 0.5046 0.4944
MLP(Neural Net) 0.4792 0.5826 0.5259
SAD Performance:
Model Precision Recall F1-Score
--------------------------------------------------
Random Forest 0.5169 0.6024 0.5564
SVM 0.5571 0.4803 0.5159
MLP(Neural Net) 0.5927 0.5787 0.5857
Conclusion
This study investigated the capability of machine learning models to classify emotional states from acoustic features using the CREMA-D dataset. Across multiple approaches including Random Forest, SVM, and a multilayer perceptron, the MLP consistently demonstrated superior performance in both overall accuracy and per-emotion metrics, highlighting its ability to capture complex nonlinear patterns in the PCA-reduced feature space. While traditional ensemble and kernel based methods provided competitive results for certain emotions, neural networks offered the most balanced performance across all classes, particularly for challenging emotions such as Disgust, Fear, and Neutral.
These findings suggest that for multi-class emotion recognition from audio, neural networks are better suited to leverage nuanced acoustic features compared to standalone or ensemble methods. Future work could explore integrating temporal modeling with recurrent neural networks or transformer based architectures, combining audio and visual modalities, or experimenting with advanced feature extraction methods to further improve classification accuracy and robustness.
Source Code
---title: "Acoustic Emotion Classification"subtitle: "Optimizing Performance Through Ensemble Methods"author: - name: "Ralp Andrade" affiliations: - name: "College of Information Science, University of Arizona"description: "An emotion classification project using machine learning to identify four emotions (happy, sad, angry, fear) from speech audio recordings, comparing different algorithms to determine the most effective approach for acoustic emotion recognition."format: html: code-tools: true code-overflow: wrap embed-resources: trueeditor: visualexecute: warning: false echo: falsejupyter: python3---## AbstractThis study investigates the capability of machine learning models to classify emotional states from acoustic features. Using the CREMA-D (Crowd-Sourced Emotional Multimodal Actors Dataset), we focused on six target emotions: neutral, happy, sad, angry, fear, and disgust. Numerical audio features were extracted via librosa, standardized, and reduced using PCA to retain 98% of variance. We evaluated and compared standalone algorithms (SVM), ensemble methods (Random Forest), and neural networks (MLP) for multi-class emotion recognition. Results indicate that the MLP achieved the highest macro F1-score (0.5534), demonstrating superior ability to capture non-linear patterns and balanced performance across all emotions.## IntroductionAutomatic emotion classification from speech remains a challenging problem due to the subtlety and variability of acoustic cues. This project leverages the CREMA-D dataset, comprising 7,442 .wav clips from 91 actors portraying six basic emotions. Prior research demonstrates that features such as MFCCs and spectral properties are informative for emotion detection, yet there is no consensus on optimal feature selection or model architecture (Banerjee, Huang, & Lettiere, n.d.).We transformed raw audio into quantitative features using librosa and applied a standardized preprocessing pipeline. The goal is to assess the effectiveness of both traditional and neural network-based models for multi-class emotion recognition, providing a comparative evaluation of standalone algorithms, ensemble methods, and neural networks.## Research Question- **Q1. What is the classification accuracy of unsupervised methods for emotion recognition from acoustic features?**- **Q2. How do standalone algorithms, ensemble approaches, and neural networks compare in terms of accuracy, robustness, and computational efficiency for emotion classification from audio?**## Exploratory Analysis### Inital data load & Observations```{python}#| label: dataset#| echo: true# Import libsimport numpy as npimport pandas as pdimport matplotlib.pyplot as pltimport seaborn as snsfrom sklearn.preprocessing import PowerTransformer, LabelEncoder, StandardScalerfrom sklearn.decomposition import PCAfrom sklearn.model_selection import train_test_split# Import required libraries for models and metricsfrom sklearn.ensemble import RandomForestClassifierfrom sklearn.svm import SVCfrom sklearn.neural_network import MLPClassifierfrom sklearn.metrics import (classification_report, confusion_matrix, accuracy_score, precision_score, recall_score, f1_score)#import data into dfdf = pd.read_csv("./data/crema_d.csv", index_col=0)# number of variables and observations in the dataprint(f"Total observations: {df.shape[0]}")print(f"Number of features: {df.shape[1]}")# Numeric summaryprint(df.describe())# missing values in each columnmissing_df = df.isna().sum()print("Missing values per column:\n", missing_df)```### Target Class DistributionWe analyzed class distribution across the six emotions (neutral, happy, sad, angry, disgust, and fear). Our analysis showed that the "neutral" class has approximately 14.5% fewer samples than the other classes. Given this mild imbalance, no resampling was performed.```{python}#| warning: false#| message: false#| echo: trueplt.figure(figsize=(8, 6))sns.countplot(data=df, x='emotion', hue='emotion')plt.title("Target Counts")plt.xticks(rotation=45)plt.show()# Show class distribution for all emotionsemotion_counts = df['emotion'].value_counts()#print("Class distribution:\n", emotion_counts)# Focus on the four target emotionstarget_emotions = ['happy', 'sad', 'angry', 'fear', 'neutral','disgust']target_counts = df[df['emotion'].isin(target_emotions)]['emotion'].value_counts()#print("\nTarget emotion counts:\n", target_counts)```## Data Preprocessing### Data Cleaning & TransformationData pre-processing included handling missing values, removing irrelevant columns, and transforming numerical features to reduce skewness. The Yeo–Johnson power transformation was applied to achieve more symmetric distributions (skewness \< ±0.5), improving suitability for downstream modeling. The categorical target variable (‘emotion’) was encoded into numerical labels for compatibility with machine learning algorithms.```{python}#| warning: false#| message: false#| echo: true# Remove unwanted columnsdf = df.drop(columns=['actor_id', 'sentence', 'intensity','sample_rate'])#Indentify skew of numeric colsnum_cols = df.select_dtypes(include='number').columns#print(num_cols)skewness_before = df[num_cols].skew().sort_values(ascending=False)#print("Before:", skewness_before)# yeo-johnson power transformeryeojt = PowerTransformer(method='yeo-johnson', standardize=True)df_transformed = yeojt.fit_transform(df[num_cols])# Create DataFrame from transformed data with correct columnsdf_t = pd.DataFrame(df_transformed, columns=num_cols)# Overwrite original columns in df with transformed valuesdf[num_cols] = df_t# Get skewness after transformationskewness_after = df[num_cols].skew().sort_values(ascending=False)#print("After:", skewness_after)# Combined before/after skewness chartfig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))# Before transformationsns.barplot(x=skewness_before.values, y=skewness_before.index, ax=ax1, color='lightcoral')ax1.set_title("Skewness Before Yeo-Johnson", fontsize=14)ax1.axvline(x=0, color='red', linestyle='--', alpha=0.7)ax1.set_xlabel("Skewness", fontsize=12)# After transformationsns.barplot(x=skewness_after.values, y=skewness_after.index, ax=ax2, color='lightblue')ax2.set_title("Skewness After Yeo-Johnson", fontsize=14)ax2.axvline(x=0, color='red', linestyle='--', alpha=0.7)ax2.set_xlabel("Skewness", fontsize=12)# Adjust layoutplt.tight_layout()plt.show()# encode target variablele = LabelEncoder()df['encoded_emotion'] = le.fit_transform(df['emotion'])# View mapping#print(dict(zip(le.classes_, le.transform(le.classes_))))```### Splitting the datasetThe data is split using the train_test_split function, with 20% of the data reserved for testing. For the features (X), we drop the target column and its encoded variant, while for the labels (y), we retain only the encoded emotion.```{python}#| warning: false#| message: false#| echo: true# train/test split for target emotionsX = df.drop(columns=['emotion','encoded_emotion'])y = df['encoded_emotion']X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)```### Data Scaling and Dimension ReductionBefore training our models, we applied data scaling and dimensionality reduction to the numeric features. First, we identified all numeric columns in the dataset and applied Standard Scaling to ensure each feature has zero mean and unit variance.Next, we performed Principal Component Analysis (PCA) to reduce dimensionality while retaining 98% of the variance. PCA transforms the scaled features into a set of orthogonal components, capturing most of the information in fewer dimensions. We evaluated the explained variance ratio for each component and visualized it using a scree plot to confirm the number of components selected.```{python}#| warning: false#| message: false#| echo: true# select all numeric columnsnum_cols = X_train.select_dtypes(include='number').columns# Apply standard scaler on numeric datascaler = StandardScaler()X_train_scaled = scaler.fit_transform(X_train[num_cols])X_test_scaled = scaler.transform(X_test[num_cols])# Apply PCA while retaining 90pca = PCA(n_components=.98)X_train_pca = pca.fit_transform(X_train_scaled)X_test_pca = pca.transform(X_test_scaled)# Explained variance ratioexplained_variance = pca.explained_variance_ratio_# Create scree plotplt.figure(figsize=(8, 5))plt.plot(range(1, len(explained_variance)+1), explained_variance, 'o-', linewidth=2, color='blue')plt.title('Scree Plot')plt.xlabel('Principal Component')plt.ylabel('Variance Explained')plt.xticks(range(1, len(explained_variance)+1))plt.grid(True)plt.show()```## Model Training and EvaluationTo address our research questions on emotion recognition from acoustic features, we trained and evaluated multiple machine learning models, including standalone algorithms, ensemble methods, and neural networks. The models were trained on the PCA-reduced and standardized feature set to improve convergence, reduce dimensionality, and mitigate potential overfitting.For each model, we used the evaluate_model function, which trains the model and reports comprehensive performance metrics. These metrics include overall accuracy, macro-averaged precision, recall, and F1-score, as well as per-class performance for each emotion in the target set. To provide a visual assessment of prediction quality, confusion matrices were generated for all models.Specifically, we evaluated: - **Random Forest (RF):** An ensemble method configured with 1000 trees, maximum depth of 15, and out-of-bag scoring to provide robust predictions while controlling for overfitting.- **Support Vector Machine (SVM):** A kernel-based model with RBF kernel, class-weight balancing for mild class imbalance, and probability estimates enabled.- **Multilayer Perceptron (MLP):** A neural network with three hidden layers (128, 64, 32 neurons), early stopping, and a 10% validation split to monitor convergence.```{python}#| warning: false#| message: false#| echo: true# Skip extending y_train - we want to exclude omitted emotions from predictions# Use original y_train that only contains target emotions# Get emotion labels for reportingemotion_labels = [emotion for emotion in le.classes_ if emotion in target_emotions]target_label_mapping = {le.transform([emotion])[0]: emotion for emotion in emotion_labels}print("Target emotion label mapping:", target_label_mapping)def evaluate_model(model, X_train, X_test, y_train, y_test, model_name):""" Train and evaluate a model with comprehensive metrics """print(f"Training {model_name}")print(f"{'='*50}")# Train model model.fit(X_train, y_train)# Make predictions y_pred = model.predict(X_test)# Overall metrics accuracy = accuracy_score(y_test, y_pred) precision_macro = precision_score(y_test, y_pred, average='macro', zero_division=0) recall_macro = recall_score(y_test, y_pred, average='macro', zero_division=0) f1_macro = f1_score(y_test, y_pred, average='macro', zero_division=0)print(f"\n{model_name} Overall Performance:")print(f"Accuracy: {accuracy:.4f}")print(f"Macro Precision: {precision_macro:.4f}")print(f"Macro Recall: {recall_macro:.4f}")print(f"Macro F1-Score: {f1_macro:.4f}")# Per-class metricsprint(f"\n{model_name} Per-Class Performance:") target_names = [target_label_mapping.get(i, f'Class_{i}') for i insorted(np.unique(y_test))] report = classification_report(y_test, y_pred, target_names=target_names, output_dict=True, zero_division=0)# Display per-class metrics in a formatted wayprint(f"{'Emotion':<10}{'Precision':<10}{'Recall':<10}{'F1-Score':<10}{'Support':<10}")print("-"*50)for emotion in target_names:if emotion in report: metrics = report[emotion]print(f"{emotion:<10}{metrics['precision']:<10.4f}{metrics['recall']:<10.4f} "f"{metrics['f1-score']:<10.4f}{int(metrics['support']):<10}")# Confusion Matrix cm = confusion_matrix(y_test, y_pred) plt.figure(figsize=(8, 6)) sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=target_names, yticklabels=target_names) plt.title(f'{model_name} - Confusion Matrix') plt.xlabel('Predicted') plt.ylabel('Actual') plt.show()# Return results for comparisonreturn {'model_name': model_name,'accuracy': accuracy,'precision_macro': precision_macro,'recall_macro': recall_macro,'f1_macro': f1_macro,'per_class_report': report }# Initialize models# random forestrf_model = RandomForestClassifier( n_estimators=1000, # More trees for better performance max_depth=15, # Prevent overfitting while allowing complexity min_samples_split=5, # Require more samples to split min_samples_leaf=2, # Minimum samples in leaf nodes max_features='sqrt', # Feature sampling at each split bootstrap=True, # Bootstrap sampling oob_score=True, # Out-of-bag scoring n_jobs=-1, # Use all available cores random_state=42)# support vector machinesvm_model = SVC( kernel='rbf', # RBF kernel works well for most cases C=10.0, # Regularization parameter gamma='scale', # Kernel coefficient (auto-scaled) probability=True, # Enable probability estimates cache_size=1000, # Increase cache for faster training class_weight='balanced', # Handle imbalanced datasets random_state=42)# standard neral network mlp_model = MLPClassifier( hidden_layer_sizes=(128, 64, 32), # network max_iter=1000, # Maximum iterations early_stopping=True, # Stop when validation score stops improving validation_fraction=0.1, # validation set random_state=42)# Evaluate ensemble models along with stand alone modelsresults = []# Random Forestrf_results = evaluate_model(rf_model, X_train_pca, X_test_pca, y_train, y_test, "Random Forest")results.append(rf_results)# SVMsvm_results = evaluate_model(svm_model, X_train_pca, X_test_pca, y_train, y_test, "SVM")results.append(svm_results)# MLPmlp_results = evaluate_model(mlp_model, X_train_pca, X_test_pca, y_train, y_test, "MLP(Neural Net)")results.append(mlp_results)```## Model ComparisionWe compared the performance of three model types—Random Forest, SVM, and a multilayer perceptron (MLP), on the emotion recognition task using macro-averaged metrics and per-emotion performance. Across overall performance metrics, the MLP consistently outperformed both Random Forest and SVM, achieving the highest accuracy (0.5567), macro precision (0.5538), macro recall (0.5574), and macro F1-score (0.5534). This indicates that the neural network is most effective at capturing complex patterns in the PCA-reduced acoustic feature space.Per-emotion analysis revealed nuanced differences among models. For Angry, all models performed relatively well, with MLP achieving the highest F1-score (0.7374). Disgust and Fear were more challenging emotions, with lower F1-scores overall, though MLP slightly improved performance over other models. For Happy and Neutral, MLP again showed superior F1-scores, particularly in improving recall for the Neutral class (0.5826). Sad emotion classification also favored MLP, demonstrating balanced precision and recall (F1-score 0.5857).Overall, while ensemble methods like Random Forest and kernel-based SVM provide competitive performance for certain classes, the MLP’s ability to model non-linear interactions across multiple dimensions makes it the best-performing approach in this task. These results highlight that neural network-based models may be better suited for emotion recognition from acoustic features, addressing both classification accuracy and balanced performance across all emotions.```{python}#| warning: false#| message: false#| echo: trueprint("MODEL COMPARISON SUMMARY")print(f"{'='*50}")# Create comparison DataFramecomparison_df = pd.DataFrame({'Model': [result['model_name'] for result in results],'Accuracy': [result['accuracy'] for result in results],'Precision (Macro)': [result['precision_macro'] for result in results],'Recall (Macro)': [result['recall_macro'] for result in results],'F1-Score (Macro)': [result['f1_macro'] for result in results]})print(comparison_df.round(4))# Visualize model comparisonfig, axes = plt.subplots(2, 2, figsize=(15, 12))metrics = ['Accuracy', 'Precision (Macro)', 'Recall (Macro)', 'F1-Score (Macro)']colors = ['skyblue', 'lightcoral', 'lightgreen']for i, metric inenumerate(metrics): row = i //2 col = i %2 ax = axes[row, col] bars = ax.bar(comparison_df['Model'], comparison_df[metric], color=colors) ax.set_title(f'{metric} Comparison') ax.set_ylabel(metric) ax.set_ylim(0, 1)# Add value labels on barsfor bar in bars: height = bar.get_height() ax.text(bar.get_x() + bar.get_width()/2., height +0.01,f'{height:.3f}', ha='center', va='bottom')plt.tight_layout()plt.show()# Best performing modelbest_model_idx = comparison_df['F1-Score (Macro)'].idxmax()best_model = comparison_df.iloc[best_model_idx]print(f"\nBest Performing Model: {best_model['Model']}")print(f"F1-Score (Macro): {best_model['F1-Score (Macro)']:.4f}")# Detailed per-emotion analysis across all modelsprint("PER-EMOTION PERFORMANCE ACROSS MODELS")print(f"{'='*50}")# Get target emotion names from the unique test labelstarget_names = [target_label_mapping.get(i, f'Class_{i}') for i insorted(np.unique(y_test))]emotion_performance = {}for emotion in target_names: emotion_performance[emotion] = {}for result in results:if emotion in result['per_class_report']: emotion_performance[emotion][result['model_name']] = {'precision': result['per_class_report'][emotion]['precision'],'recall': result['per_class_report'][emotion]['recall'],'f1_score': result['per_class_report'][emotion]['f1-score'] }# Create detailed emotion analysisfor emotion in target_names:print(f"\n{emotion.upper()} Performance:")print(f"{'Model':<15}{'Precision':<10}{'Recall':<10}{'F1-Score':<10}")print("-"*50)for model_name, metrics in emotion_performance[emotion].items():print(f"{model_name:<15}{metrics['precision']:<10.4f}{metrics['recall']:<10.4f}{metrics['f1_score']:<10.4f}")```## ConclusionThis study investigated the capability of machine learning models to classify emotional states from acoustic features using the CREMA-D dataset. Across multiple approaches including Random Forest, SVM, and a multilayer perceptron, the MLP consistently demonstrated superior performance in both overall accuracy and per-emotion metrics, highlighting its ability to capture complex nonlinear patterns in the PCA-reduced feature space. While traditional ensemble and kernel based methods provided competitive results for certain emotions, neural networks offered the most balanced performance across all classes, particularly for challenging emotions such as Disgust, Fear, and Neutral.These findings suggest that for multi-class emotion recognition from audio, neural networks are better suited to leverage nuanced acoustic features compared to standalone or ensemble methods. Future work could explore integrating temporal modeling with recurrent neural networks or transformer based architectures, combining audio and visual modalities, or experimenting with advanced feature extraction methods to further improve classification accuracy and robustness.